On Minimax Optimal Offline Policy Evaluation

نویسندگان

Lihong Li

Rémi Munos

Csaba Szepesvári

چکیده

This paper studies the off-policy evaluation problem, where one aims to estimate the value of a target policy based on a sample of observations collected by another policy. We first consider the multi-armed bandit case, establish a minimax risk lower bound, and analyze the risk of two standard estimators. It is shown, and verified in simulation, that one is minimax optimal up to a constant, while another can be arbitrarily worse, despite its empirical success and popularity. The results are applied to related problems in contextual bandits and fixed-horizon Markov decision processes, and are also related to semi-supervised learning.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Toward Minimax Off-policy Value Estimation

متن کامل

On the Minimax Optimality of Block Thresholded Wavelets Estimators for ?-Mixing Process

We propose a wavelet based regression function estimator for the estimation of the regression function for a sequence of ?-missing random variables with a common one-dimensional probability density function. Some asymptotic properties of the proposed estimator based on block thresholding are investigated. It is found that the estimators achieve optimal minimax convergence rates over large class...

متن کامل

Regularized Policy Iteration with Nonparametric Function Spaces

We study two regularization-based approximate policy iteration algorithms, namely REGLSPI and REG-BRM, to solve reinforcement learning and planning problems in discounted Markov Decision Processes with large state and finite action spaces. The core of these algorithms are the regularized extensions of the Least-Squares Temporal Difference (LSTD) learning and Bellman Residual Minimization (BRM),...

متن کامل

Tracking of signal and its derivatives in Gaussian white noise

For the observation model “signal + white Gaussian noise”, an on-line tracking algorithm for signal and its derivatives is proposed. The tracking algorithm applies to a class of signals with derivative up to the k-th order. The asymptotic optimality in the minimax sense, with respect to small intensity of noise, is established.

متن کامل

Empirical Results on Convergence and Exploration in Approximate Policy Iteration

In this paper, we empirically investigate the convergence properties of policy iteration applied to the optimal control of systems with continuous state and action spaces. We demonstrate that policy iteration requires lesser iterations than value iteration to converge, but requires more function evaluations to generate cost-to-go approximations in the policy evaluation step. Two different alter...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

CoRR

دوره abs/1409.3653 شماره

صفحات -

تاریخ انتشار 2014

On Minimax Optimal Offline Policy Evaluation

نویسندگان

چکیده

منابع مشابه

Toward Minimax Off-policy Value Estimation

On the Minimax Optimality of Block Thresholded Wavelets Estimators for ?-Mixing Process

Regularized Policy Iteration with Nonparametric Function Spaces

Tracking of signal and its derivatives in Gaussian white noise

Empirical Results on Convergence and Exploration in Approximate Policy Iteration

عنوان ژورنال:

اشتراک گذاری